Skip to content

Add multiple name-similarity algorithms to threat actor report tool#1173

Merged
adulau merged 1 commit intomainfrom
codex/update-threat_actor_similarity_report.py-with-algorithms
Apr 12, 2026
Merged

Add multiple name-similarity algorithms to threat actor report tool#1173
adulau merged 1 commit intomainfrom
codex/update-threat_actor_similarity_report.py-with-algorithms

Conversation

@adulau
Copy link
Copy Markdown
Member

@adulau adulau commented Apr 12, 2026

Motivation

  • Improve detection of potentially duplicate or similar threat-actor names by adding additional comparison methods beyond the existing SequenceMatcher approach.
  • Allow analysts to combine results from different similarity strategies and surface only pairs that are agreed on by multiple algorithms when desired.

Description

  • Add multiple algorithms: sequence (existing difflib.SequenceMatcher), levenshtein (normalized edit-distance), compression (NCD-inspired compression approximation for Kolmogorov-complexity-style similarity), and vector (cosine similarity over character n-gram count vectors), with a dispatcher (get_similarity).
  • Add CLI options --algorithms (comma-separated list or all) and --combine-mode (union or intersection) to control which algorithms are used and whether to include pairs matched by any algorithm or only those common to all selected algorithms.
  • Include per-algorithm scores in outputs and compute an aggregate score for sorting; update markdown and JSON output formats and report metadata to show selected algorithms and combine mode.
  • Refactor pair filtering and scoring into helper functions (should_compare_pair, individual similarity functions, and find_similar_name_pairs) to support multi-algorithm evaluation and aggregation.

Testing

  • Ran python3 tools/threat_actor_similarity_report.py --help and confirmed the new options are shown and exit succeeded.
  • Executed the tool on a small test cluster with --algorithms sequence,levenshtein,vector --combine-mode union --threshold 0.5 which completed successfully and produced markdown/JSON output reporting Analyzed normalized names: 5 and Potential similar name pairs: 3.
  • Attempted a larger run with --algorithms sequence --max-results 5 in this environment which timed out due to pairwise processing cost on the full dataset (environment timeout), indicating longer runs may be needed for large clusters.

Codex Task

@adulau adulau merged commit 26c7eb2 into main Apr 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant